Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BB-624 Retry connection to kafka if backbeat processes started before #2574

Merged
merged 12 commits into from
Nov 6, 2024

Conversation

BourgoisMickael
Copy link
Contributor

@BourgoisMickael BourgoisMickael commented Nov 4, 2024

  • Suppress AWS SDK warning in stderr
  • Improve container image size from 2.34GB to 1.55GB (layer chown + dockerignore)
REPOSITORY                   TAG                         IMAGE ID       SIZE         BLOB SIZE
scality/backbeat             7.70.15.rc1-nodesvc-base    ea2e98003485    1.6 GiB      491.7 MiB
scality/backbeat             9.4.0.0                     cf4ecf197272    2.3 GiB      1.8 GiB
  • Backport / cherrypick some fixes from zenko to crash exit after a timeout on connection error with kafka client (for CRR, Lifecycle, Bucket notification). Even for bucket notification destination.
  • Prevent ProbeServer crash on startup by fixing metric handling when the consumer is not set up and delaying it after the queue components are started.

New behavior: components will try to connect for 60s instead of 30s and crash exit and restart if kafka client can't connect with error like:

{"name":"Backbeat:QueueProcessor:task","time":1730775546353,"method":"QueueProcessor::task","error":{"origin":"local","message":"broker transport failure","code":-195,"errno":-195,"stack":"Error: Local: Broker transport failure\n    at Function.createLibrdkafkaError [as create] (/home/scality/backbeat/node_modules/node-rdkafka/lib/error.js:454:10)\n    at /home/scality/backbeat/node_modules/node-rdkafka/lib/client.js:350:28"},"level":"error","message":"error during queue processor initialization","hostname":"MDM-RING-46789-store-1","pid":594}
{"name":"Backbeat:QueueProcessor:task","time":1730776197182,"error":{"origin":"local","message":"timed out","code":-185,"errno":-185,"stack":"Error: Local: Timed out\n    at Function.createLibrdkafkaError [as create] (/home/scality/backbeat/node_modules/node-rdkafka/lib/error.js:454:10)\n    at /home/scality/backbeat/node_modules/node-rdkafka/lib/client.js:350:28"},"method":"MetricsProducer::setupProducer","level":"error","message":"error starting metrics producer for queue processor","hostname":"MDM-RING-46789-store-2","pid":146}

Missing from S3C-9338 because this usage of
sdk doesn't use the env variable
@bert-e
Copy link
Contributor

bert-e commented Nov 4, 2024

Hello bourgoismickael,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Available options
name description privileged authored
/after_pull_request Wait for the given pull request id to be merged before continuing with the current one.
/bypass_author_approval Bypass the pull request author's approval
/bypass_build_status Bypass the build and test status
/bypass_commit_size Bypass the check on the size of the changeset TBA
/bypass_incompatible_branch Bypass the check on the source branch prefix
/bypass_jira_check Bypass the Jira issue check
/bypass_peer_approval Bypass the pull request peers' approval
/bypass_leader_approval Bypass the pull request leaders' approval
/approve Instruct Bert-E that the author has approved the pull request. ✍️
/create_pull_requests Allow the creation of integration pull requests.
/create_integration_branches Allow the creation of integration branches.
/no_octopus Prevent Wall-E from doing any octopus merge and use multiple consecutive merge instead
/unanimity Change review acceptance criteria from one reviewer at least to all reviewers
/wait Instruct Bert-E not to run until further notice.
Available commands
name description privileged
/help Print Bert-E's manual in the pull request.
/status Print Bert-E's current status in the pull request TBA
/clear Remove all comments from Bert-E from the history TBA
/retry Re-start a fresh build TBA
/build Re-start a fresh build TBA
/force_reset Delete integration branches & pull requests, and restart merge process from the beginning.
/reset Try to remove integration branches unless there are commits on them which do not appear on the source branch.

Status report is not available.

Copy link
Contributor

@nicolas2bert nicolas2bert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestions:

  • add tests for "Prevent ProbeServer crash on startup by delaying it after the queue components are started".
  • add tests for Consumer service crashes in the event of a consumer error?

bin/queuePopulator.js Show resolved Hide resolved
lib/BackbeatConsumer.js Outdated Show resolved Hide resolved
lib/BackbeatConsumer.js Outdated Show resolved Hide resolved
@BourgoisMickael BourgoisMickael force-pushed the bugfix/BB-624-connection-retry branch from 46b33ab to 13bf58c Compare November 5, 2024 20:06
Prevent hanging indefinitely if replication status' BackbeatConsumer
succeeds to connect to kafka but then FailedCRRPRoducer or
ReplayProducer fails
@BourgoisMickael BourgoisMickael force-pushed the bugfix/BB-624-connection-retry branch from 13bf58c to dbe3cc5 Compare November 5, 2024 20:12
If probing too soon on startup some CRR components
can crash.
Check consumer exists for metric function to use it.
Also the probe server is started after the queue (like in ZENKO)
@scality scality deleted a comment from bert-e Nov 6, 2024
@scality scality deleted a comment from bert-e Nov 6, 2024
@bert-e
Copy link
Contributor

bert-e commented Nov 6, 2024

Waiting for approval

The following approvals are needed before I can proceed with the merge:

  • the author

  • 2 peers

The following options are set: create_pull_requests, create_integration_branches

@BourgoisMickael
Copy link
Contributor Author

Suggestions:

  • add tests for "Prevent ProbeServer crash on startup by delaying it after the queue components are started".
  • add tests for Consumer service crashes in the event of a consumer error?

@nicolas2bert
I added a simple test for the crash on probing, but for testing the whole task file that stops on error it's not easily testable, I'll try to use vm module but it might take some time.

@scality scality deleted a comment from bert-e Nov 6, 2024
@scality scality deleted a comment from bert-e Nov 6, 2024
@scality scality deleted a comment from bert-e Nov 6, 2024
@scality scality deleted a comment from bert-e Nov 6, 2024
Comment on lines +127 to +128
// if connection to destination fails, process will stop & restart
next => this._destination.init(next),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About this it looks like on zenko it could crash with a "callback already called" if destination Kafka is not up on startup

@scality scality deleted a comment from bert-e Nov 6, 2024
@bert-e
Copy link
Contributor

bert-e commented Nov 6, 2024

Waiting for approval

The following approvals are needed before I can proceed with the merge:

  • the author

  • 2 peers

The following options are set: create_pull_requests

@scality scality deleted a comment from bert-e Nov 6, 2024
@scality scality deleted a comment from bert-e Nov 6, 2024
@BourgoisMickael
Copy link
Contributor Author

BourgoisMickael commented Nov 6, 2024

Here is an Integration e2e run: https://github.com/scality/Integration/actions/runs/11699734386/job/32582300296

It should not trigger any exit implemented in this PR as all kafka dependencies are already UP in those tests

@BourgoisMickael
Copy link
Contributor Author

/create_integration_branches

@scality scality deleted a comment from bert-e Nov 6, 2024
@scality scality deleted a comment from bert-e Nov 6, 2024
@scality scality deleted a comment from bert-e Nov 6, 2024
@BourgoisMickael
Copy link
Contributor Author

/approve

@scality scality deleted a comment from bert-e Nov 6, 2024
@scality scality deleted a comment from bert-e Nov 6, 2024
@bert-e
Copy link
Contributor

bert-e commented Nov 6, 2024

I have successfully merged the changeset of this pull request
into targetted development branches:

  • ✔️ development/7.10

  • ✔️ development/7.70

  • ✔️ development/8.5

  • ✔️ development/8.6

  • ✔️ development/8.7

The following branches have NOT changed:

  • development/7.4

Please check the status of the associated issue BB-624.

Goodbye bourgoismickael.

The following options are set: approve, create_integration_branches

@bert-e bert-e merged commit 2b05c9f into development/7.10 Nov 6, 2024
2 checks passed
@bert-e bert-e deleted the bugfix/BB-624-connection-retry branch November 6, 2024 11:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants